Document clustering based on keyword frequency and concept matching technique in Hadoop

نویسندگان

  • R.Priyadarshini
  • Latha Tamilselvan
چکیده

. The term big data has come into use in recent years. It is used to refer to the ever-increasing amount of information that organisations are storing, processing and analysing. Owing to the growing number of information sources big data based file systems are necessary. Due to the explosion growth of digital information, automatic document clustering or categorization has become more important. Document management and clustering is more important for content management systems. It is a necessary to mine and store web documents in CMS. In this study online document based content management with automatic URL indexing is dealt with. It’s highly possible that the content in CMS will be redundant over the web as most of the time the content will be gathered from already existing websites. Back tracking the source of such content will become obsolete and also changes to the source are difficult to be tracked. So to solve this problem document based content management (DCMS) is very much essential. Stored documents in DCMS comprises of huge amount of data so there is need for document clustering. Previous works on clustering documents have no consideration for the semantic information as they consider only the structural information. In this study, a novel semantic and similarity measure based technique is proposed that concurrently considers both structural and semantic information of document. Semantic analysis based clustering is applied to the text documents and then similarity measure is devised among the documents based on machine learning algorithms using Apache hadoop. In order to achieve accurate clustering and efficient retrieval, initially the documents are stored in hadoop distributed file system and they are clustered using K-means algorithm.Then the clustering is also done using concept matching technique and time for formation of clusters were plotted and compared.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

خوشه‌بندی اسناد مبتنی بر آنتولوژی و رویکرد فازی

Data mining, also known as knowledge discovery in database, is the process to discover unknown knowledge from a large amount of data. Text mining is to apply data mining techniques to extract knowledge from unstructured text. Text clustering is one of important techniques of text mining, which is the unsupervised classification of similar documents into different groups. The most important step...

متن کامل

Document Image Retrieval Based on Keyword Spotting Using Relevance Feedback

Keyword Spotting is a well-known method in document image retrieval. In this method, Search in document images is based on query word image. In this Paper, an approach for document image retrieval based on keyword spotting has been proposed. In proposed method, a framework using relevance feedback is presented. Relevance feedback, an interactive and efficient method is used in this paper to imp...

متن کامل

Big Text Data Clustering using Class Labels and Semantic Feature Based on Hadoop of Cloud Computing

Clustering of class labels can be generated automatically, which is much lower quality than labels specified by human. If the class labels for clustering are provided, the clustering is more effective. In classic document clustering based on vector model, documents appear terms frequency without considering the semantic information of each document. The property of vector model may be incorrect...

متن کامل

A Joint Semantic Vector Representation Model for Text Clustering and Classification

Text clustering and classification are two main tasks of text mining. Feature selection plays the key role in the quality of the clustering and classification results. Although word-based features such as term frequency-inverse document frequency (TF-IDF) vectors have been widely used in different applications, their shortcoming in capturing semantic concepts of text motivated researches to use...

متن کامل

Contextual Abstraction Based Clustering Technique for Effective Text Document Mining

Document clustering is considered to be the essential process in grouping the unsupervised documents for effectual applications in text mining and information retrieval. Recently, many research works has been developed for text document clustering. However, performance of clustering the text document is not effective. In order to overcome such limitation, a novel Contextual Abstraction based Do...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2014